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Abstract 

Social networks constitute a new platform for information propagation, but its success is crucially 
dependent on the choice of spreaders who initiate the spreading of information. In this paper, 
we remove edges in a network at random and the network segments into isolated clusters. The 
most important nodes in each cluster then form a group of influential spreaders, such that news 
propagating from them would lead to an extensive coverage and minimal redundancy. The method 
well utilizes the similarities between the pre-percolated state and the coverage of information 
propagation in each social cluster to obtain a set of distributed and coordinated spreaders. Our 
tests on the Facebook networks show that this method outperforms conventional methods based 
on centrality. The suggested way of identifying influential spreaders thus sheds light on a new 
paradigm of information propagation on social networks. 
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I. INTRODUCTION 


The development of social network has a great impact on our lifestyle, from making friends 
to dating, from working to shopping. They become more essential as we are increasing our 
dependence on them to gather information. Compared with search engines which are based 
on isolated queries, collecting information through leveraging the individual specialties in 
social networks lead us to useful websites from experts in disparate helds, and thus increase 
both the quality and the diversity of the acquired information. Thus, by the same token, 
influential individuals can also be used to spread information. The key to the success is to 
identify the most influential spreaders in the network. Nevertheless, it is difficult as there 
are usually just a few users capable of propagating a news to a large number of users [1]. 
For example, while socially signihcant users are rare in the tweeter network, their messages 
and blogs can spread quickly throughout the whole network [2, 3]. 

Although social networks are powerful for propagating information, their application for 
this purpose is limited, partially because a way to identify the optimal spreaders is absent. 
Nevertheless, simple methods have been proposed. For instance, “degree centrality” suggests 
that nodes with higher degree are more influential than the others [4]. On the other hand, 
the location of a node in a network and the influence of its neighbors are also considered 
important. For instance, a node with a small number of highly influential neighbors located 
at the center of the network may be more influential than a node having a larger number 
of less influential neighbors. Kitsak et ah [5] thus proposed a coarse-grained method to 
use the /c-core decomposition to quantify the influence of a node, based on the assumption 
that news initiated at nodes in higher shells are likely to spread more extensively. Some 
distance-based global metrics such as betweenness [6] and closeness [7] are suggested which 
can lead to extensive propagation, but due to the high computational complexity, they are 
not practical for large-scale social networks. Other centralities such as LocalRank were also 
suggested [8]. 

The above simple but sub-optimal protocols have been applied to social media such 
as QQ , BBS and Blog to hnd the key spreaders who can trigger the “tipping point” in 
social marketing to promote commercial products. Specihcally, if one can convince a set of 
influential users to adopt a new product, one may induce a large cascade of purchases as 
these initial buyers propagate their compliment of the product along the network. Unlike 
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the forementioned methods which identify a set of independent spreaders according to their 
centralities, our goal is to hnd a set of coordinated individuals such that their combined 
impact is greatest, leading to much more extensive propagation of information. Nevertheless, 
identifying the optimal group of spreaders is indeed a computationally difficult task [9]. 

In this paper, we utilize the similarities [10, 11] between percolation and information 
propagation to identify a group of influential spreaders. By removing edges at random 
until percolation ceases, individual isolated clusters are formed. Due to the correspondence 
between percolation and information transmission, the emergence of such clusters imply 
that news can be effectively propagated within the clusters but not across the clusters. 
Initiating a news on the most influential user in each cluster is thus an effective way to 
distribute the news within the cluster. Since such process is static and requires much less 
computation power than the dynamics spreading of news, a lot of percolated states can be 
generated to give a more accurate result on the segmentation of social clusters as well as 
their corresponding influential spreaders. 

By testing our protocol on Facebook and Enron email network, we show that in addition 
to a lower computational efficiency, our protocol outperforms other simple heuristics based 
on local and global centrality in terms of propagation coverage and coverage redundancy of 
the selected spreaders. This is consistent with the old saying that the power of a typical 
group exceeds that of a single most competent individual. Moreover, we hnd that the 
average degree of the users selected by our method is lower, which implies a lower cost 
in identifying the spreaders when compared to the other methods. We also identify the 
different characteristics of spreaders who are most effective to promote niche or popular 
items in order to maximize the coverage. All these results lead to insights into the design of 
viral marketing strategies and a new paradigm for information propagation. 


II. RESULTS 

Spreading dynamics with the involvement of human can be mainly classihed into two 
classes: one is the spreading of infectious diseases which requires physical contacts, and the 
other is the spreading of information, including opinions and rumors where physical contacts 
are not required [12]. Due to the similarity between epidemic and information spreading, 
well-established models of epidemic models are widely used to describe the propagation of 
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information [13-17]. 

In particular, the susceptible-infected-recovered (SIR) model is one of the representatives. 
Specihcally, a susceptible person (S) in the model is analogous to an individual who is not 
aware of the information. An infected person (I) is analogous to an individual who is 
aware of the information and will pass it to his/her neighbors. A recovered person (R) is 
analogous to an individual who loses his/her interest and will never pass the information 
again. Newman [11] studied in detail the relation between the static properties of the SIR 
model and bond percolation phenomenon on networks and remarked that the SIR model 
with transmissibility p is equivalent to a bond percolation model with bond occupation 
probability p on the network (see Method and Materials 4.5). 

Our method is then devised in relation to the bond percolation model as follows. Given 
an undirected network G{V,E) where V represents the set of nodes (i.e., users in social 
networks) and E represents the set of edges (i.e., connection in terms of communication, 
friendship or other kinds of interactions), all edges are hrst removed and each individual edge 
is then recovered with a probability p, i.e. all links are removed when p = 0. As p increases 
from p = 0, more links are recovered and clusters start to form and merge with each other. 
We will call this state the pre-percolated state. For a network containing N node, a giant 
component of size 0{N) emerges only when p is larger than a critical threshold p = pc, which 
is called percolation. In the context of information propagation, since an edge between two 
nodes appears with a probability p, the value p can be considered as the transmissibility of 
an information from one node to another. 

To hnd the influential group, we have to hnd the W most influential spreaders with a 
given value of p. Assume that there are m percolation clusters after one realization of link 
recovery, and denote by Si the size of cluster i, i = 1,2, ,m. We introduce a tunable 

parameter L, which is usually equal to or larger than W. li L < m, we choose the top-L 
largest clusters and assign one score to the largest degree node in each cluster. If there are 
many nodes with the largest degree, we assign the score to a random one among them. If 
m < L < 2m, we hrst choose the top-L node in each cluster, and the rest L — m nodes are 
chosen to be those with the second largest degree respectively from the top-(L — m) largest 
clusters. If L > 2m, we will choose the next largest degree nodes in each cluster following 
the same selection rules. After t times of different trials of link recovery, all nodes are ranked 
according to their scores in a descending order and those W nodes with the highest scores 
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are suggested to be the set of initial spreaders. For the sake of simplicity, we set L = W 
and have tested and found that the results are not sensitive to L. The dependence of L are 
shown in the Supplementary Information (SI) Sec. SV Fig. S6. 

In other words, our suggested method draws analogy with percolation to identify indi¬ 
vidual social clusters in the network where news can be effectively propagated within the 
clusters but not across the clusters. These isolated clusters in the pre-percolated state thus 
have a direct correspondence to the propagation coverage when one spreads the news from 
an initial spreader in each of the clusters. Our rationale is different from most other meth¬ 
ods which usually identify a group of influential spreaders for the network as a whole. In 
addition, such a set of well distributed spreaders also enjoy a reduced redundancy when 
compared to a set of un-coordinated spreaders. These differences make our method unique 
compared to the other methods. 

A. Spreadability and coverage redundancy 

To quantify the performance of our method, we examine the spreadability, i.e. the prop¬ 
agation coverage of a news from a set of k selected spreaders, by our method as well as other 
methods. We will use the SIR model to mimic the spreading of news, and the spreadability 
is dehned as the ratio of recovered nodes to the total number of nodes (i.e., the size of 
outbreak to N). We remark that the transmissibility p adopted in the SIR model is the 
same as the probability p used to recover edges to identify the clusters in the percolated 
states. As a result, for a single spreader, the ultimate size of the SIR outbreak triggered by 
this spreader is precisely the size of the percolation cluster that it belongs to. Likewise, the 
ultimate size of the SIR outbreak triggered by a group of spreaders in distinct percolation 
clusters is the sum of the size of the clusters that these nodes belong to. For example, if we 
measure the coverage of three selected nodes on the network with N nodes, and if the first 
two nodes belong to the cluster Si^ and the third one is in the cluster 82 - For each of the 
single nodes, the coverage of node 1,2 and 3 are respectively S'l/A, Si/N and S 2 /N, while 
for the whole group, the spreadability of the three nodes is {Si + 82 )/N. 

We hrst apply our method on the Facebook network with 59691 nodes. Figure la shows 
the coverage obtained from 4000 initial spreaders chosen by our percolation method, com¬ 
pared with a set of 4000 spreaders identihed by three other methods, namely the degree 
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centrality, the /c-shell decomposition and the betweeness centrality (see Methods and Ma¬ 
terials for the dehnition of each of these methods; comparisons with other centrality mea¬ 
sures can be found in Sec. Sill Fig. S2 of the SI). Percolation method yields the highest 
spreadability for an arbitrary transmissibility p . Figure lb shows the degree distribution of 
the 4000 spreaders identihed by the percolation method. 

When p < Pc = 0.01,the percolation method yields isolated clusters [18] (see Fig. Id) 
of similar size, and since the set of the selected spreaders come from different clusters, 
and a wide range of degree is found among the spreaders (see Fig. lb). In this case, the 
percolation method is more likely to choose high-degree nodes (see Fig. S5b in the SI, where 
the red stars represent the degree distribution of the 4000 selected nodes when p = 0.008). 
When p > Pc, the distribution will become narrower as p increases (see the blue squares 
in Fig. S5b of the SI). In this case, the percolation method prefers low-degree spreaders. 
The average original degree (i.e. degree in the original network before edge removal) of the 
4000 spreaders selected by the percolation method when p < pc is higher than that of the 
nodes selected when p > pc- This implies that if we want to promote and advertise a new 
niche product which is difficult to get accepted, one can draw analogy with the case of small 
transmissibility p where high-degree initial spreaders are preferred. On the other hand, for 
popular items which are easy to be accepted, one can draw analogy with the case of large p 
and low-degree initial spreaders are preferred. 

We then examine the cost of identifying the initial spreaders. By assuming that the direct 
influence of a user is equal to the number of its nearest neighbors (i.e., its degree), while 
the difficulties of finding a user with degree k is proportional to l/p(A;), the cost to find a 
spreader i is assumed to be ki/p{ki). Figure Ic shows the dependence of the average cost 
to find the 4000 spreaders under the parameter p, i.e. ^ decreases 

abruptly at the critical point Pc, indicating a phenomenon resembling phase transition. It 
means that when p increases just beyond Pc, the cost can be reduced substantially. 

Besides the spreadability and the cost, we also examine the redundancy in coverage 
which quantifies the efficiency of the propagation. Specifically, the redundancy of a node i is 
defined as the number of initial spreaders who has the potential to infect node i. A method 
is inefficient if the initial chosen spreaders pass the same information to the same group more 
than once. Averaging the redundancy over all the infected nodes, we obtain the redundancy 
of the set of initial spreaders. Figure le compares the spreading redundancy of our method 
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TABLE I: The percentage of the different distribntion of the fonr spreaders in a model 
network with fonr communities. For instance, the four spreaders identified by our methods 
are found in four different communities in 93.9% of the realizations. The procedure to 
construct the networks is shown in Method and Materials. The results are obtained based 
on 1000 realizations. The critical point of the network is pc ~ 0.2687 ± 0.0152. We have set 


p = 0.28 in our percolation method. 


Type 

u 


u 

m 

m 

Percolation 

93.9 

6.1 

0 

0 

0 

Max degree 

11 

54.4 

14 

19.5 

1.1 

K-shell 

9.4 

43.6 

11.9 

16.2 

18.9 

Betweenness 

13.1 

64.5 

11.7 

10.6 

0.1 


with the three other methods (comparisons with other centrality measures can be found 
in Sec. Sill Fig. S3 of the SI). Highest redundancy is found in the methods of /c-shell 
and degree centrality, followed by betweenness centrality. Our percolation method has the 
lowest redundancy among the four methods, since the spreaders identified by this method 
are usually located in different regions of the original network. We also checked the Enron 
e-mail network and similar results with Facebook network are obtained (see Supplementary 
Sec.II hgure SI). 

To further examine the spreadability, we applied the four methods to identify four initial 
spreaders on a generated network with four clear communities. As shown in Fig. 2, the 
four spreaders identified by the percolation method are very likely to be found in different 
communities, with one spreader in each community. For the other methods, there are high 
probabilities that all or some of the initial spreaders are in the same communities. These 
results are easy to understand as our method relies on the segmentation of the network 
into isolated clusters to identify the spreaders. In the present case, the network is likely to 
separate into the four communities and thus one spreader is found in each community. 

Most of the other methods always lead to the same set of spreaders. In our percolation 
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FIG. 1: The performance of the percolation method on the Facebook network, (a) The 
spreadability of the 4000 spreaders selected by the percolation method (blue solid line), 
maximum degree (squares), fc-shell (red dash line) and betweeness centrality (dots), (b) 
The degree distribution s of the 4000 initial spreaders selected by the percolation method 
given different transmissibility p. The color indicates the frequency, blue to red 
corresponds to low to high, (c) The cost to identify such a group of spreaders, (d) The size 
of the second giant component, S' 2 nd, after percolation with different transmissibility p. 
The largest value of S 2 nd [19] is obtained at the critical point Pc = 0.01. (e) The coverage 

redundancy of the four methods. 

method, different set of spreaders may be generated from different realizations, especially 
for large p. Figure 2 (a) shows the number of initial spreaders which are common among 
different realizations of the percolation method applied on the Facebook network. It is 
clear that when p increases, the number of common spreaders decreases, indicating that the 
solutions become more and more diverse. This result has practical signihcance, especially 
when some of the initial spreaders are offline, we can use the next best candidate as a back- 

























(a) (b) 

FIG. 2: The diversity of the spreaders identihed by the percolation method on the 
Facebook network, (a) The number of common nodes between different realizations of the 
percolation method on the Facebook network, (b) The scale of solution space. We use a 
logarithmic scale for better presentation. In each realization, we select 4000 nodes by 

percolation method. 

up spreader without losing spreadability. On the other hand, Fig. 2 (b) shows the entropy 
of the obtained solution, i.e. the logarithm of the number of different identified spreaders. 
Compared with the other three methods, percolation method provides a higher flexibility in 
the choice of spreaders. 

In order to further examine the difference between our method and the other methods. 
Fig. 3 shows the number of identihed spreaders which are common between the percolation 
method and the other methods (comparisons with some other methods are found in Sec. 
Sill Fig S4 of the SI). The overlap between the percolation method and the degree centrality 
method reaches the highest value (~ 61.5%) at the critical point pc = 0.01 and then sharply 
decreases to less than 5% when p = 0.03. It is because when p increases, most of the 
high-degree nodes are replaced by nodes with lower degree, and there are a lot of sets of 
identified spreaders generated from the different realizations as we have discussed in Fig. 2. 
What is more impressive is that by increasing the value of p the cost can sharply decrease 
without losing spreadability and substantially increasing coverage redundancy (see Fig 1(c) 
and 1(e)). 

We show in Fig. 5 the relation between the spreadability and the cost. The percolation 
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FIG. 3: The number of common spreaders identified by the percolation method and the 
other three methods on the Facebook network. We have set W = L = 500 for the 

percolation method. 

method is the most cost effective method in terms of spreadability. Four cases are presented, 
namely p = 0.008 < Pc, P = 0.01 = Pc, P = 0.012 and p = 0.02 > pc. Clearly, with the 
same cost, the percolation method lead to a higher spreadability than the methods of /c-shell 
and degree centrality. Although the cost for using betweenness is low, its spreadability is 
very limited and become saturated at small cost. The percolation method has the highest 
saturated value of spreadability. 

III. DISCUSSION 

As we can see, social networks constitute a new platform to propagate information. Unlike 
the usual practice where the networks are used by uncoordinated individuals to share their 
own message, intended spreading of information can indeed be implemented via the networks. 
To measure its performance, one can measure the coverage, the redundancy in propagation. 
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(a) p = 0.008 


(b) p = 0.01 




(c) p = 0.012 


(d) p = 0.02 


FIG. 4: The spreadability versus the cost to identify spreaders on the Facebook network. 

We selected 4000 nodes and set L = 4000 in the percolation method. Here we present a 
typical case as examples. For a given p, we select 4000 top-ranked nodes according to the 
different methods. For example, to obtain the results of the percolation method (i.e. blue 
curve), we draw the total cost of the top-/c (fc = 1, 2, 3, • • • , 4000) nodes as x-axis value and 
their corresponding spreadability (as a group) as p-axis value. 


and the cost in identifying appropriate initial spreaders. Yet these measures of performance 
are largely dependent on the choice of users who start the propagation, and there is not 
a single protocol which achieves optimality in all these dimensions. These difficulties of 
identifying influential spreaders makes information propagation via social networks remain 
in its immature stage. 

To tackle the challenge, we draw an analogy between the percolation process and in¬ 
formation propagation to develop a protocol which gives rise to a low-cost, minimally re¬ 
dundant set of initial spreaders leading to a large coverage. Our protocol was tested on 
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the Facebook network, where favorable results over all the tested centrality-based methods 
were obtained. When compared to these conventional methods which identify a set of un¬ 
coordinated spreaders, the spreaders identified by our protocol are evenly distributed within 
the network which greatly increases the propagation coverage and reduces its redundancy. 
Such coordination of spreaders is essential and can only be obtained using the suggested 
percolation procedures. 

The success of this method is not just a coincidence, but it makes the best use of the 
similarities between percolation and the process of information propagation. By removing 
edges at random until percolation ceases, we identify individual isolated clusters where 
news can be effectively propagated within the clusters but not across the clusters. Specihc 
spreaders at the center of these clusters are then identihed to be the influential initial 
spreaders in the original network. By initiating news propagating from this set of spreaders, 
coverage is increased and redundancy is reduced compared to the conventional centrality 
methods. Percolation is thus at the center of our propagation protocol instead of a mere 
analogy. 

The remaining question is practicality. As we have discussed, the computational complex¬ 
ity of our protocol is 0(|i?|), which is a favorable characteristics for applications on practical 
systems as its complexity scales linearly with the system size. Once the set of important 
initial spreaders is identihed, an coordinator just has to connect to these users and pass 
the news to them, and information will then propagate quickly throughout the network. Of 
course, a lot of details and practical difficulties are omitted in this simple description, but 
our results have lead to insights into a completely new paradigm of information propaga¬ 
tion. Further research along this line may revolutionize our way of spreading and gathering 
information in the near future. 

IV. METHODS AND MATERIALS 

A. Baseline methods 

To identify the most inhuential spreaders, various centrality measures have been proposed. 
The hrst method by which we compare our result with is degree centrality. Degree centrality 
is a straightforward and efficient metric. It assumes that a node with more nearest neighbors 
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has a higher influence. However, node degree can only reflect its direct influence but not the 
indirect influence triggered by its nearest neighbors. For example, a node of small degrees, 
but with a few highly influential neighbors may be more influential than a node having a 
larger number of less influential neighbors. 

The second method we used for comparison is the /c-shell decomposition. Recent research 
shows that the location of a node in a network may play a more important role than its 
degree. A node located in the center of the network is more influential than a node having 
a larger number of less influential neighbors. Similar to this rationale, Kitsak et ah [5] 
proposed a coarse-grained method by using the method of k-core decomposition to quantify 
the influence of a node based on the assumption that nodes in the same shell have similar 
influence, and nodes in higher-level shells are likely to infect more nodes. 

In the last method, we employ global information to identify the influential spreaders. 
Specifically, betweenness is one of the most popular geodesic-path-based ranking measures. 
It is defined as the fraction of shortest paths between all node pairs that pass through the 
node of interest. Betweenness is, in some sense, a measure of the influence of a node in 
terms of its role in spreading information [20, 21]. For a network G = {V,E) with n = \V\ 
nodes and m = \E\ edges, the betweenness centrality of node v, denoted by B{v) is [6, 22] 

( 1 ) 

9st 

where gst is the number of shortest paths between nodes s and t, and gst{v) denotes the 
number of shortest paths between nodes s and t which pass through node v. 

B. Computational complexity 

Given a network G{V,E), there are four steps to And the W influential spreaders by 
the percolation method. Firstly, all the edges are first removed and then recovered with a 
probability p; we then obtain a new network G'. The required computational complexity is 
0{\E\). Secondly, we And the strongly connected components of G' using Tarjan’s algorithm 
[23] which has a complexity of 0(|R| -|- |F^|). Thirdly, we select one node with the highest 
degree in each of the L largest components and assign one score to the selected nodes. This 
complexity for the procedures is 0{L * |R|). Repeating the above three steps for different 
realizations, we rank the nodes according to their scores in descending order, and the top- 
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k nodes are chosen to be the most influential spreaders. The different realizations of the 
percolation process can be computed in parallel and the complexity of each implementation is 
0{\E\ + \V\ + \E\+L-\V\). Considering (/c) = then the complexity is 0[((/c)+L + l)-|C|]. 
Since {k) \V\ in real networks, then we have 0[{{k) + L + 1) ■ \V\] ~ 0(|C|), i.e. the 

complexity of our method grows linearly with system size. 

C. Model networks with community structures 

There are three steps to generate a network with community structures. In our exper¬ 
iment, we consider a network with 2000 nodes which has four communities each of which 
contains 500 nodes. First, we generate a random network of size 500 and with node de¬ 
grees distributed in power-law with exponent 2.2 using the configuration model [24]. The 
minimum degree is 1 and the maximum degree is v^SOO ~ 23 [25]. Second, we repeat the 
above procedures to generate independently the other three networks. Finally, for each pair 
of sub-networks we randomly selected a fraction q of node pairs to connect them. 

D. Datasets 

Datasets we used are described in Sec. I of the SI and the statistical features of the real 
networks are summarised in Table SI. 

E. SIR model aud bond percolation 

Susceptible-Infected-Recovered (SIR) model [26] is usually used to mimic the spreading 
processes of disease. Individuals in this model are classified in three states: susceptible {S, 
does not carry the disease and will not infect others but can be infected), infected (/, carry 
the disease and can infect others), recovered {R, either dead or recovered from the disease 
and immune to further infection). The simulation runs in discrete time steps. At each time 
step, infective individuals transmit the disease to his or her neighbors with probability [3 
and will recover with probability 7 . Then the SIR transmissibility is p = /d/y. The process 
stops when there is no infected node anymore. 

The SIR model can be mapped to a bond percolation model where each link exists with 
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a probability equals to the SIR transmissibility p [11]. After removing the other edges, a 
number of clusters are formed. It is clear that the ultimate size of the SIR epidemic outbreak 
is triggered by a single initially infected node, which is precisely the size of the percolation 
cluster that the initial node belongs to. Apparently, the nodes in the same cluster are 
expected to have the same coverage. A review article on epidemic processes in complex 
networks can be found in Ref. [10]. 
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